7 Random Variables and Distributions
So far in this section, we’ve talked about probabilities, different ways of thinking about probabilities, and a bit about how to work with probabilities. In this section we will introduce a more formal framework for how to think about and handle uncertain events. By making a few assumptions about how things behave, we can calculate probabilities of events without observing them.
7.1 Random Variables
A random variable is a variable where the value is not guarenteed in advance, but can take different values.
7.1.1 Examples: random variables
Define a variable \(X\) to be the outcome of a coin flip. Now, before we flip the coin, we do not know what value \(X\) will take on – it could be “heads” or it could be “tails”. Once we flip the coin and observe the outcome, we say that we have a realization of the random variable \(X\).
Another example would be if we let \(Y\) be the height of a randomly chosen US adult. We don’t know exactly what value it is, but we do know a few things about it. For example, it is much more likely to be around 5.5ft than it is to be around 7ft or below 4ft. When we do finally randomly select a US adult, and measure their height, we get a realization of this random variable.
A third example is if we let \(Z\) denote the diabetes status of a randomly chosen US adult. This could take the values healthy, type I, or type II, and each will happen with some probability.
As the example above was meant to illustrate, a random variable can really be anything you’d like. And whenever we talk about a random variable, we also talk about the probability of certain outcomes. If we can define a way to calculate probabilities of different outcomes of the random variable, we call this the distribution of the random variable.
Recall previously we talked about two kinds of variables: discrete and continuous variables. Likewise, we can consider both discrete and continuous random variables. Depending on the kind of random variable we’re discussing, defining it’s distribution is handled slightly differently. When we consider discrete random variables, its distribution is defined by specifying the probability of every single possible outcome. There are two things that are important to remember:
- all probabilities must be between \(0\) and \(1\),
- the sum of all probabilities must add up to \(1\).
The second point above is important, and is sometimes super handy when trying to calculate probabilities of certain complicated events. The intuition behind it is pretty simple: something must happen. So the probability that something happens is \(1\).
7.1.2 Examples: discrete distributions
Consider \(X\) the outcome of a coin flip. The outcome of this can be one of two things: heads or tails. Now, let us pretend that this particular coin is NOT fair, i.e. it is not 50/50. Maybe the probability of getting tails is 0.4. Maybe the probability of getting heads is 0.1. For now, let the probability of getting heads be \(p\), some number between 0 and 1. Then the probability of \(X\) coming up as heads is \(p\), and the probability it comes up as tails is \(1-p\), since the sum of all probabilities has to be \(1\). We write \(P(X = \text{heads}) = p\) and \(P(X = \text{tails}) = 1-p\). This is the distribution of \(X\). In this case, all we need is the probability of the two outcomes.
Another example: let \(X\) be the marital status of a randomly chosen participant from the SHOW data.
The examples above all consider discrete random variables. As already mentioned, the approach for continuous random variables is a bit different. For the distribution of a continuous random variable, we need to specify a curve for which the area under the curve is \(1\). When we talk about probabilities of events that relate to the correpsonding random variable, we talk about areas under the curve.
7.1.3 Examples: continuous distributions

7.2 Properties of Random Variables
When we talk about random variables, there is a great deal of uncertainty involved, since (by design) we do not know exactly what values the random variables will take after a conducted experiment. Similarly, we cannot be sure that repeating an experiment results in the same outcomes of the random variables simply since they are, as the name strongly implies, random. However, if we have some information about the random variable we’re interested in, we can talk about some very important features of the random variable. The two we will talk about here are the expected value and variance/standard deviation of random variables.
These two concepts can be a bit hard to wrap ones head around at first, but as we talk about them over and over agian, hopefully you will realize that they are not as abstract as they might first seem.
7.2.1 Expected Value of Random Variables
The expected value of a random variable is, intuitively, the long run average. I.e. if we repeat an experiment an infinite number of times, we can determine the expected value of a random variable as the average of all the realizations of said random variable. As an example, if we consider the random variable \(X\) that is \(0\) if a coin flip comes up heads, and \(1\) if it comes up tails, we can imagine flipping a coin an infinite number of times, and calculating the average. The result would be that the expected value of \(X\) is \(0.5\). We write \(E(X) = 0.5\).
Since the expected value can be thought of as the long run average, it is in some sense the value that the outcomes of the random variable are going to be centered around.
Note: the expected value is also often referred to as the mean value.
For any discrete random variable where we know the distribution, we can find the expected value in the following way: \(E(X) = x_1 \cdot P(X = x_1) + ... + x_n P(X = x_n) = \sum_{i=1}^n x_i P(X = x_i)\).3
7.2.1.1 Example: expected value of discrete random variable
Let \(X\) be a discrete random variable the can take the values \(1,2,6\), and \(12\). Let the probabilities of each outcome be as follows:
| x | P(X = x) |
|---|---|
| 1 | 0.2 |
| 2 | 0.1 |
| 6 | 0.6 |
| 12 | 0.1 |
Then we can calculate the expected value of \(X\):
\[ \begin{align} E(X) &= \sum_{i = 1}^4 x_i P(X = x_i) \\ &= 1 \cdot P(X = 1) + 2 \cdot P(X = 2) + 6 \cdot P(X = 6) + 12 \cdot P(X = 12) \\ &= 1\cdot 0.2 + 2\cdot 0.1 + 6 \cdot 0.6 + 12 \cdot 0.1 \\ &= 5.2. \end{align} \]
So what does this mean? It means that if we perform an experiment that results in a realization of the random variable \(X\) many, many, many times, the average of all outcomes is going to be close to \(5.2\).
7.2.1.2 Example: expected value of a continuous random variable
In the continuous case, actually calculating the expected value isn’t as easy as in the discrete case. Remember, when we specify a discrete distribution, we specify the probability of each possible outcome. When we specify a continuous distribution, we specify a curve over all the possible outcomes, and probabilities of specific events correspond to areas under the curve. This also means that it is impossible to use a formula like the one introduced for the discrete case above. Fortunately, the intuition is the same. The expected it the long run average.
7.2.1.3 Rules for working with expected values
Sometimes, it is very beneficial to be able to transform a random variable, or combine several random variables, into a new one, and work with that new random variable. Fortunately, dealing with the expected value of a large number of such transformations is pretty simple.
First, let’s imagine we have a random variable \(X\) with mean \(E(X)\), and another random variable \(Y\) with mean \(E(Y)\). Perhaps we are interested in the sum of the two, so we construct a new random variable \(Z = X + Y\).4 Finding the expected value of \(Z\) is really simple: \(E(Z) = E(X + Y) = E(X) + E(Y)\). In words: the expected value of a sum of random variables is simply the sum of expected values.
Another example: maybe we want to scale the outcome of the random variable \(X\) by a constant \(a\), and then consider the new random variable \(Y = a\cdot X\).5 Again, finding the expected value of the new random variable \(Y\) is really simple: \(E(Y) = E(aX) = a\cdot E(X)\).
One final thing I want to mention here: the expected value of a constant will always be the constant itself. Hopefully, this doesn’t come as too much of a shock. The expected value is what we would expect from a random variable. If something is constant, it means it never changes, so we expect it to stay the same. So, if \(a\) is a constant, \(E(a) = a\). This can be combined with the first rule we talked about to give us that \(E(X + a) = E(X) + E(a) = E(X) + a\).
7.2.2 Variance/Standard Deviation of Random Variables
Where the expected value of a random variable tells us something about where the outcomes of the random variable tend to be located, the next measures we’ll be looking at tell us something about how spread out the outcomes will be around the expected value.
Note: most textbooks handle the variance and standard deviations as two distinct things. I don’t like that. They are virtually two sides of the same coin, and I will deliberately handle the two at the same time. My reasoning for this is that, at least in my head, these two measures try to convey the same message, but to two different audiences. I will elaborate on this later, but try to keep in mind that these two measures are almost the same.
The variance of a random variable is a measure that tells us how much we expect the outcome of said random variable to vary from the expected value. As with the expected value, it is relatively simple to calculate this when we are dealing with simple discrete random variables. Let \(X\) be a discrete random variable with possible outcomes \(x_1, ..., x_n\), and the probability of \(x_i\) is \(P(X = x_i)\). Then the variance of \(X\) is \(\text{Var}(X) = \sum_{i=1}^n P(X = x_i)(x_i - E(X))^2\). At first glance, this can look a bit intimidating, so let’s try to break it down to better understand what’s going on:
- It actually has the form of an expected value, i.e. it is a sum of where each term is the product of the value of an outcome and the probability of that outcome. So, intuitively, this is not much different than an expected value, it’s just an expected value of something else.
- That “something else” is \((x_i - E(X))^2\). This is representative of the distance from the outcome \(x_i\) to the expected value…
- … except, we square the distance. We do this because we want this measure to be representative of the variation of the data, and so we cannot allow positive and negative differences to cancel. Example: if we didn’t square the differences, a random variable with possible outcomes \(1,2,3\) each with probability \(1/3\) would have variance \(0\), but clearly there is some variation in the sample – not all observations are the same.
So, loosely speaking, the variance is “a measure of averaged distances from observations to the sample average”.
7.2.2.1 Rules for working with variance
Working with the variance of random variables is not quite as simple as working with the expected value. This is due to the fact that the expected value is a simple average, whereas the variance is an average of squared differences. The result is the following set of rules: if \(X\) and \(Y\) are random variables, and \(a\) is some fixed constant, then
- \(\text{Var}(a\cdot X) = a^2 \text{Var}(X)\),
- \(\text{Var}(a) = 0\),
- if \(X\) and \(Y\) are independent: \(\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)\).
Combining (1) and (2) above tells us that, if \(X\) and \(Y\) are independent, then \(\text{Var}(X - Y) = \text{Var}(X + (-Y)) = \text{Var}(X) + \text{Var}(-Y) = \text{Var}(X) + (-1)^2 \text{Var}(Y) = \text{Var}(X) + \text{Var}(Y)\). Don’t forget this!!
7.2.2.2 So what about that standard deviation?
So far we’ve talked about the variance, a bit about how to interpret it, and how to work with it for multiple random variables. But what about that other thing mentioned above, the standard deviation?
The standard deviation of a random variable is simply the square root of the variance: \(\text{SD}(X) = \sqrt{\text{Var}(X)}\). As we saw above, the variance has some nice mathematical properties, such as the fact that it is basically an expected value, and that (when \(X\) and \(Y\) are independent) \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\). Neither of these two things are true for the standard deviation, and we lose both because of the square root. However, it is also because of an effect of the square root that we like using the standard deviation in certain situation.
As mentioned, the variance is nice mathematically, but as soon as we make our way back from the beautiful haven that is the Land of Mathematics, and want to communicate our findings to collaborators or the rest of the world, the variance isn’t great. Since we square all the differences, the unit of the variance is whatever unit your original measure was squared. Example: we might wish to estimate the height of adults in the SHOW data, and report it with some measure of uncertainty. We find that the average height is 66.470625 inches, and the variance is 22.1376297… \(\text{inches}^2\)? This is hard to really grasp, and the number itself doesn’t mean much to us. Is 22 \(\text{inches}^2\) a lot? We can’t even really compare it to the mean because of the different units! However, the standard deviation fixes just that. It is still a measure of the expected variation, but this has been brought by to the original scale by taking the square root. So when we report a mean height of 66.470625 inches with a standard deviation of 4.7050643 inches, this all of a sudden makes much more sense intuitively.
The moral of the story: both the variance and the standard deviation have a role in the world of statistics, but at different stages. The variance is very useful in the more mathematical parts of the field, while the standard deviation is easier to interpret. Luckily, going from one to the other is simple: \(\text{Var}(X) = \text{SD}(X)^2\) and \(\text{SD}(X) = \sqrt{\text{Var}(X)}\). Therefore, if you ever have one, you practically have both. Don’t forget this, as it is a common mistake to plug in the variance to equations where it should have been the standard deviation.
7.2.3 Things to remember when working with random variables
When working with random variables, \(X\) and \(Y\), these are the important rules:
- \(E(X + Y) = E(X) + E(Y)\),
- if \(a\) is some fixed number, \(E(a\cdot X) = a\cdot E(X)\),
- if \(a\) is some fixed number, \(\text{Var}(a \cdot X) = a^2 \text{Var}(X)\),
- IF \(X\) and \(Y\) are independent, \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)\),
- IF \(X\) and \(Y\) are independent, \(\text{Var}(X - Y) = \text{Var}(X) + \text{Var}(Y)\).
Things people often forget:
- \(E(X\cdot Y) \neq E(X)E(Y)\),
- \(E\left(\frac{X}{Y}\right) \neq \frac{E(X)}{E(Y)}\),
- \(\text{Var}(X+Y) \neq \text{Var}(X) + \text{Var}(Y)\) if \(X\) and \(Y\) are not independent,
- \(\text{Var}(X - Y) \neq \text{Var}(X) - \text{Var}(Y)\),
- \(\text{SD}(X + Y) \neq \text{SD}(X) + \text{SD}(Y)\), even when \(X\) and \(Y\) are independent.
7.3 A Few Important Distributions
7.3.1 The Bernoulli Distribution
In the first example above, we consider flipping a coin. Such an experiment, i.e. one with only two possible outcomes, is often referred to as a Bernoulli experiment, and the random variable \(X\) is referred to as a Bernoulli random variable. The “probability of success” (you get to pick our favorite outcome as a success) is often denoted \(p\). As a shorthand for such a random variable, we write \(X \sim \text{Bernoulli}(p)\), which is read as “\(X\) follows a Bernoulli distribution with probability parameter \(p\)” or “\(X\) is Bernoulli distributed with parameter \(p\)”. Phrases like these can sometimes sound scary and complex, but all it means is that the random variable \(X\) can only take on two different outcomes, and the probability of \(X\) being one of the two outcomes is \(p\), the probability of it being the other is \(1-p\). (Important note: remember that the sum of all probabilities has to be \(1\), so if the probability of one outcome is \(p\), and there are only two possible outcomes, then the probability of the other outcome must be \(1-p\). This way of thinking is something we will use over and over again.)
Using the properties discussed in section 7.2, we can calculate the expected value and variance of a Bernoulli random variable. Simply using the definitions, we see that
\[ E(X) = \sum_{i=1}^2 x_i \cdot P(X = x_i) = 0 \cdot P(X = 0) + 1 \cdot P(X = 1) = p, \]
and
\[ \begin{align} \text{Var}(X) &= \sum_{i=1}^2 P(X = x_i) \cdot (x_i - E(X))^2 \\ &= P(X = 0)\cdot (0 - p)^2 + P(X = 1)\cdot (1 - p)^2 \\ &= (1 - p)\cdot p^2 + p\cdot (1 - p)^2 \\ &= (1-p)\cdot(p^2 + p\cdot(1-p)) \\ &= (1-p)\cdot(p^2 + p - p^2) \\ &= (1-p)\cdot p. \end{align} \]
So it is actually rather simple to find the expected value and variance of a Bernoulli random variable, if we know the probability of success (\(p\)).
7.3.2 The Binomial Distribution
Often times we are interested in things that can be viewed as a sum of Bernoulli random variables. Let’s say we have \(n\) independent (i.e. the outcome of one doesn’t say anything about the rest) Bernoulli random variables (\(X_1\), \(X_2\), …, \(X_n\)), all with probability of success \(p\), and are interested in the sum of those \(n\) variables \(Y = X_1 + X_2 + ... + X_n\). For this to make sense, we let \(X_i\) be \(1\) if the corresponding “experiment” is a success, and \(0\) if it is a failure. Now, we can think of the random variable \(Y\) as either (1) the sum of independent Bernoulli random variables, or (2) the number of successes among \(n\) independent trials with binary outcomes. It is this latter interpretation that makes the random variable \(Y\) interesting.
When a random variable is the sum of \(n\) independent Bernoulli random variables all with probability of success \(p\), we say that \(Y\) follows a Binomial distribution with size \(n\) and probability of success \(p\). We write \(Y \sim \text{Binomial}(n,p)\).
Let’s think for a second about what possible values \(Y\) can take. If all \(n\) Bernoulli experiments happen to come out as failures, then all \(X_i\)’s are \(0\)’s, and so \(Y\) will also be \(0\). The other extreme is if all \(n\) Bernoulli experiments are successes, then all \(X_i\)’s are \(1\)’s, and \(Y\) will be the sum of \(n\) \(1\)’s, so \(Y\) will be \(n\). These are simply the two extremes - any number of the \(X_i\)’s can be \(1\)’s, so \(Y\) can end up being any integer between \(0\) and \(n\), both included. The most likely scenarios are the integers closest to the middle.
Since \(Y\) is simply a sum of very simple random variables, namely Bernoulli random variables, we can with very simple tools dive deeper, and try to explore what the distribution of a Binomial random variable looks like. We can find the expected value and variance, and the probability of all possible outcomes. There are two ways of doing this: (1) do the math, or (2) flip \(n\) coins an infinite number of times and see how often the number of heads is each of the possible outcomes. Let’s start with the latter.
Since it’s impossible to flip \(n\) coins (for what is \(n\)?), we have to pick a real integer. Let’s pick \(10\). Similarly, it’s impossible to flip \(10\) coins an infinite number of times, so let’s just do it a bunch of times (i.e. \(`r `\)). What we are about to do is repeat an experiment (flip \(10\) coins) many, many (\(50000\)) times. The first time we perform this experiment, we see T,T,H,H,T,H,H,T,T,H. When we translate this to \(0\) and \(1\), it looks like 0,0,1,1,0,1,1,0,0,1. So, the value of the binomial variable \(Y\) is 5, since this is the number of heads. Rinse and repeat. The results of all \(10\) experiments are shown in the table below.
Now we can get a pretty good estimate of the distribution of \(Y\). Recall, the distribution of a random variable is simply the probabilities of each possible outcome. The probability of a particular outcome, say \(Y = 2\), is the long run proportion of experiments that result in that outcome. So, \(P(Y = 2) = \frac{\text{# experiments with } Y = 2}{\text{# experiments}} = \frac{2164}{5\times 10^{4}} = 0.04328\). If we do this for every possible value of \(Y\), we get something that looks like the following:
| y | # experiments with Y = y | Estimated Probability |
|---|---|---|
| 0 | 55 | 0.0011 |
| 1 | 475 | 0.0095 |
| 2 | 2164 | 0.04328 |
| 3 | 5882 | 0.1176 |
| 4 | 10267 | 0.2053 |
| 5 | 12297 | 0.2459 |
| 6 | 10234 | 0.2047 |
| 7 | 5970 | 0.1194 |
| 8 | 2149 | 0.04298 |
| 9 | 465 | 0.0093 |
| 10 | 42 | 0.00084 |
We see that the most probable outcomes are around the middle (4,5,6) with proportions above 0.20.
Another popular way of displaying this is using a histogram:

When viewing this, the probability of a given outcome can be interpreted as the area of the corresponding bar divided by the total area.
As mentioned earlier, the distribution of a binomial random variable can also be calculated mathematically. We won’t go into the details here, but I will leave you with the formulat: \(P(Y = k) = {n \choose k}p^k (1-p)^{n-k}\)6. Take a look at the calculated probabilities below, and compare them to the estimates we got by flipping \(10\) coins \(50000\) times.
| y | # experiments with Y = y | Estimated Probability | Probability |
|---|---|---|---|
| 0 | 55 | 0.0011 | 0.0009766 |
| 1 | 475 | 0.0095 | 0.009766 |
| 2 | 2164 | 0.04328 | 0.04395 |
| 3 | 5882 | 0.1176 | 0.1172 |
| 4 | 10267 | 0.2053 | 0.2051 |
| 5 | 12297 | 0.2459 | 0.2461 |
| 6 | 10234 | 0.2047 | 0.2051 |
| 7 | 5970 | 0.1194 | 0.1172 |
| 8 | 2149 | 0.04298 | 0.04395 |
| 9 | 465 | 0.0093 | 0.009766 |
| 10 | 42 | 0.00084 | 0.0009766 |
Pretty close!
As mentioned, the expected value is basically the long run average. So, if we calculate the average of all outcomes of \(Y\) we get a good estimate of what the expected value of \(Y\) is. Similarly, the variance of the outcomes is a good estimate of the variance of the random variable \(Y\). From the data, \(\bar{y} = 4.99986\) and \(s_Y^2 = 2.4838697\). Remember those two numbers.
If we use the rules of expectation and variance from the previous section, we can find the exact expected value and variance of a binomial random variable with size \(n\) and probability of success \(p\). Since \(Y \sim \text{Binomial}(n,p)\) if \(Y = X_1 + ... + X_n\), where \(X_i \sim \text{Bernoulli}(p)\) and all \(X_i\)’s are independent, we have that
\[ \begin{align} E(Y) &= E(X_1 + ... + X_n) && \\ &= E(X_1) + ... + E(X_n) && \\ &= p + ... p && (E(X_i) = p \text{ since } X_i \sim \text{Bernoulli}(p)) \\ &= n\cdot p, && \end{align} \]
and
\[ \begin{align} \text{Var}(Y) &= \text{Var}(X_1 + ... + X_n) && \\ &= \text{Var}(X_1) + ... + \text{Var}(X_n) && (\text{since all } X_i's \text{ are independent}) \\ &= p\cdot(1-p) + ... + p\cdot(1-p) && (\text{since } X_i \sim \text{Bernoulli}(p)) \\ &= n\cdot p \cdot (1-p). && \end{align} \]
These two equations really emphasize that a Binomial random variable is really just \(n\) Bernoulli’s: notice how both the expected value and the variance is \(n\) times that of a single Bernoulli random variable.
Now let’s calculate the expected value and variance of our little experiment. We flip a coin \(10\) times. The probability of success is \(0.5\). So, \(Y \sim \text{Binomial}(10, 0.5)\), and we should have \(E(Y) = 10 \cdot 0.5 = 5\), and \(\text{Var}(Y) = 10 \cdot 0.5 \cdot (1-0.5) = 2.5\). Remember what we got for the expected value and variance? Numbers very close to these.
7.3.3 Normal Distribution
The normal distribution is most definitely the most important distribution we will discuss in this class for one reason: The Central Limit Theorem. We’ll get back to what this is later, but first let us get familiar with the normal distribution.
In contrast to the Bernoulli and Binomial distributions, the normal distribution is a continuous distribution. This means that we can not specify the probability of every single possible outcome. Instead we simply specify it using a curve. We then about areas under this curve as the probabilities. This curve is what we call the density.
The normal distribution density is specified by two parameters7. The first specifies the mean of the distribution (and is therefore called the mean or location paramater). We often use \(\mu\) to denote the mean of a normal distribution, or \(\mu_X\) if we want to really stress that we are talking about the mean of the random variable \(X\)8. The second parameter specifies the variance of the distribution. We often use \(\sigma^2\) to denote this, or \(\sigma_X^2\). The mean parameter can really be any real number, while the variance has to be positive. If \(X\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\), we write \(X \sim N(\mu, \sigma^2)\).
So what does this curve actually look like? It’s a bell curve that is centered at the mean \(\mu\) and the shape/width is controlled by the variance \(\sigma^2\). Below are a few examples. The first figure shows varying means, the second varying variances.


As the names of the parameters suggest, the actual expected value and variance of a random variable that is normally distributed, \(X \sim N(\mu, \sigma^2)\), is simply \(\mu\) and \(\sigma^2\), respectively.
7.3.3.1 Linear Combination of Normal with Constant
One really neat property of the normal distribution is that if you add a constant number, \(a\), to a random variable you again get something that is normally distributed. Similarly, if you multiply by a constant you get back something that is still normally distributed. The exact normal distribution can easily be specified (recall: to specify a normal distribution, we need to find the mean and variance). For completeness, let’s do this. If \(X \sim N(\mu, \sigma^2)\), then \(Y_1 = X+a\) and \(Y_2 = a\cdot X\) are also normally distributed. Using the properties of expected value and variance from section 7.2, we get that
\[\begin{aligned} E(Y_1) &= E(X+a) = E(X) + a = \mu + a, \\ E(Y_2) &= E(a\cdot X) = a E(X) = a\cdot \mu, \end{aligned}\]
and
\[ \begin{aligned} \text{Var}(Y_1) &= \text{Var}(X+a) = \text{Var}(X) = \sigma^2, \\ \text{Var}(Y_2) &= \text{Var}(a\cdot X) = a^2 \text{Var}(X) = a^2 \sigma^2. \end{aligned} \]
So \(X + a \sim N(\mu + a, \sigma^2)\), and \(a\cdot X \sim N(a\cdot \mu, a^2 \sigma^2)\).
One particular case of the normal distribution plays an important role in much of statistics, and is therefore been named the Standard Normal Distribution. For historic reasons, we often use \(Z\) to denote the standard normal distribution, which is simply a normal distribution with mean \(0\) and variance \(1\). I.e. \(Z \sim N(0,1)\). One reason why this is important is that it provides sort of a baseline that we can always revert to. Whenever you are working with a normal distribution, you can use the rules above to get the standard normal. If \(X \sim N(\mu, \sigma^2)\), then \(\frac{X-\mu}{\sigma} = Z \sim N(0,1)\). Why? As we just discussed, adding a constant to a normal random variable results in something normal. \(X-\mu\) is simply adding \(-\mu\) to \(X\), so this is still normal. We also saw that multiplying by a constant is still normal, so since \(\frac{X-\mu}{\sigma}\) is simply multiplying \(X\) by \(\frac{1}{\sigma}\), we have that \(\frac{X-\mu}{\sigma}\) is a normal random variable. We can find it’s mean and variance using the rules we’ve learned, and get that \(E\left(\frac{X - \mu}{\sigma}\right) = \frac{E(X) - \mu}{\sigma} = 0\), and \(\text{Var}\left(\frac{X - \mu}{\sigma}\right) = \frac{\text{Var}(X)}{\sigma^2} = 1\), so \(\frac{X-\mu}{\sigma} = Z \sim N(0,1)\).
7.3.3.2 Sum of (Independent) Normals
Another really important and useful property of the normal distribution is that if you have two normally distributed variables, \(X \sim N(\mu_X, \sigma_X^2)\) and \(Y \sim N(\mu_Y, \sigma_Y^2)\), then the sum of the two, \(X + Y\), is also a normally distributed random variable.
The mean parameter of this newly created random variable is always easy to find: \(E(X + Y) = E(X) + E(Y) = \mu_X + \mu_Y\). The variance is, in general a bit harder, except if the two are independent of each other. In this case \(\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) = \sigma_X^2 + \sigma_Y^2\). So, if \(X\) and \(Y\) are independent, then \(X + Y \sim N(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2)\).
Combining this with the rules stated above, we get that \(X - Y\) is normally distributed as well, since \(X - Y = X + (-Y)\), and both \(X\) and \(-Y\) are normally distributed. More applications of the rules give us \(X - Y \sim N(\mu_X - \mu_Y, \sigma_X^2 + \sigma_Y^2)\). (NOTE: we DO NOT subtract the variances.)
7.3.3.3 Some Exploration through Simulations
To illustrate the properties presented in the two previous sections, let us take a look at some simulated data. Let \(X \sim N(-0.5, 1)\) and \(Y \sim N(1, 1.5)\). I.e. \(X\) and \(Y\) follow these two distributions:

Now, let’s say we’re actually interested in \(W = X - Y\). That is, we perform an experiment, observe a realization of \(X\) and \(Y\), and then create a realization of \(W\) as \(w = x - y\). The first experiment results in \(x = 1.5653, y = 0.7306\), and so \(w = x - y = 0.8347\). We repeat this experiment many, many (\(10^{4}\)) times. This enables us to take a look at histograms of the outcomes of \(X\), \(Y\), and \(W\), and we calculate the observed averages and variances so that we can compare with our theoretical expectations.
So, first of all: do \(X\) and \(Y\) actually match the distributions we wanted them to come from? Below are histograms of the outcomes with the distributions overlayed. Notice how closely the histograms follow the curves. It definitely seems that the outcomes of \(X\) and \(Y\) indeed come from the respective normal distributions.

Now, let us take a look at the difference between the two, i.e. \(W\).

A few things to notice:
- it most definitely looks like a new normal distribution
- it seems to be centered not far from \(-1.5\)
- it seems to be wider than both of the other curves
So, do these observations match what we would expect?
- We know that the difference of two normally distributed variables should again be normally distributed
- Our rules tell us that \(E(W) = E(X - Y) = E(X) - E(Y) = -0.5 - 1 = -1.5\), so that also checks out
- The rules stated above also tell us that \(\text{Var}(W) = \text{Var}(X - Y) = \text{Var}(X) + \text{Var}(Y) = 1 + 1.5 = 2.5\), so we do expect the new curve to be wider than both of the old ones.
Finally, we can check that the averages and variances we observe are close to what the theory tells us:
| Variable | Average | Observed Variance | Mean | Variance |
|---|---|---|---|---|
| X | -0.5061716 | 0.9892788 | -0.5 | 1.0 |
| Y | 0.9963078 | 1.5060952 | 1.0 | 1.5 |
| W | -1.5024794 | 2.4628971 | -1.5 | 2.5 |
Again, only small differences between the observed and the expected.
7.3.4 t-distribution
The t-distribution is very similar to the normal distribution in that the curve also resembles a bell. Unlike the normal distribution, it only depends on one parameter, which is called the degrees of freedom, or \(df\). We use the notation \(t_{df}\) for a t-distribution with \(df\) degrees of freedom.
The t-distribution is always centered around \(0\), which is also its mean, but the variance depends on the degrees of freedom: if \(X \sim t_{df}\), then \(\text{Var}{X} = \frac{df}{df-2}\) if \(df > 2\), \(\text{Var}{X} = \inftu\) if \(1 < df < 2\), and the variance of \(X\) is actually undefined if \(df < 1\).
Below are a few examples of the t-distribution with different degrees of freedom. For comparison, the standard normal is also included. Notice how similar the t-distributions with more than 9 degrees of freedom look, and how they keep getting closer and closer to the standard normal distribution. It can actually be shown that if we had an infinite number of degrees of freedom, then the t-distribution is identical to the standard normal distribution.

7.3.5 Other Distribution
The four distributions above are the ones we’ll consider, but there are many, many more out there. Here are a few examples.
7.3.5.1 Poisson Distribution
The Poisson distribution is often used for counting things, such as the number of patients showing up in a clinic during a specified time period. It is a discrete distribution that only returns integer values. It depends on only one parameter which is often referred to as the rate parameter. It is displayed below with a few different values of the rate.

For a Poisson distributed random variable with rate parameter \(\lambda\), \(X \sim \text{Poisson}(\lambda)\), it holds that \(E(X) = \text{Var}(X) = \lambda\).
7.3.5.2 Exponential Distribution
The exponential distribution is often used for wait times. This can be useful if you want to model the wait times in an emergency room, for example. It is a continuous distribution that depends on a single parameter, which is also called the rate parameter.

For a random variable that is exponentially distributed with rate parameter \(\lambda\), \(X \sim \text{Exp}(\lambda)\), it holds that \(E(X) = \frac{1}{\lambda}\), and \(\text{Var}(X) = \frac{1}{\lambda^2}\).